Week 3: EDA and Data Visualization
EDA is the process of summarizing data
It should be the first step in your analysis pipeline, and it serves to:
EDA Checklist:
We are going to use a dataset created from the National Center for Environmental Information NOAA/NCEI. The data are hourly measurements from weather stations across the continental U.S.
It is a good idea to first have a question such as:
There are several ways to read in data (some depend on the type of data you have):
We’ll use Python packages pandas, numpy, matplotlib, seaborn, plotnine
If your data are a gzipped delimited file (e.g., met_all.gz), pandas can read it directly.
| USAFID | WBAN | year | month | day | hour | min | lat | lon | elev | ... | vis.dist.qc | vis.var | vis.var.qc | temp | temp.qc | dew.point | dew.point.qc | atm.press | atm.press.qc | rh | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 690150 | 93121 | 2019 | 8 | 1 | 0 | 56 | 34.3 | -116.166 | 696 | ... | 5 | N | 5 | 37.2 | 5 | 10.6 | 5 | 1009.9 | 5 | 19.881266 |
| 1 | 690150 | 93121 | 2019 | 8 | 1 | 1 | 56 | 34.3 | -116.166 | 696 | ... | 5 | N | 5 | 35.6 | 5 | 10.6 | 5 | 1010.3 | 5 | 21.760980 |
| 2 | 690150 | 93121 | 2019 | 8 | 1 | 2 | 56 | 34.3 | -116.166 | 696 | ... | 5 | N | 5 | 34.4 | 5 | 7.2 | 5 | 1010.6 | 5 | 18.482122 |
| 3 | 690150 | 93121 | 2019 | 8 | 1 | 3 | 56 | 34.3 | -116.166 | 696 | ... | 5 | N | 5 | 33.3 | 5 | 5.0 | 5 | 1011.6 | 5 | 16.888622 |
4 rows × 30 columns
If your data are a gzipped delimited file but stored on github, use
| USAFID | WBAN | year | month | day | hour | min | lat | lon | elev | ... | vis.dist.qc | vis.var | vis.var.qc | temp | temp.qc | dew.point | dew.point.qc | atm.press | atm.press.qc | rh | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 690150 | 93121 | 2019 | 8 | 1 | 0 | 56 | 34.3 | -116.166 | 696 | ... | 5 | N | 5 | 37.2 | 5 | 10.6 | 5 | 1009.9 | 5 | 19.881266 |
| 1 | 690150 | 93121 | 2019 | 8 | 1 | 1 | 56 | 34.3 | -116.166 | 696 | ... | 5 | N | 5 | 35.6 | 5 | 10.6 | 5 | 1010.3 | 5 | 21.760980 |
| 2 | 690150 | 93121 | 2019 | 8 | 1 | 2 | 56 | 34.3 | -116.166 | 696 | ... | 5 | N | 5 | 34.4 | 5 | 7.2 | 5 | 1010.6 | 5 | 18.482122 |
| 3 | 690150 | 93121 | 2019 | 8 | 1 | 3 | 56 | 34.3 | -116.166 | 696 | ... | 5 | N | 5 | 33.3 | 5 | 5.0 | 5 | 1011.6 | 5 | 16.888622 |
4 rows × 30 columns
We can get the types of data (integer, float, character) from met.info. This also tells us the memory usage.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2377343 entries, 0 to 2377342
Data columns (total 30 columns):
# Column Dtype
--- ------ -----
0 USAFID int64
1 WBAN int64
2 year int64
3 month int64
4 day int64
5 hour int64
6 min int64
7 lat float64
8 lon float64
9 elev int64
10 wind.dir float64
11 wind.dir.qc object
12 wind.type.code object
13 wind.sp float64
14 wind.sp.qc object
15 ceiling.ht float64
16 ceiling.ht.qc int64
17 ceiling.ht.method object
18 sky.cond object
19 vis.dist float64
20 vis.dist.qc object
21 vis.var object
22 vis.var.qc object
23 temp float64
24 temp.qc object
25 dew.point float64
26 dew.point.qc object
27 atm.press float64
28 atm.press.qc int64
29 rh float64
dtypes: float64(10), int64(10), object(10)
memory usage: 544.1+ MB
On laptops, “very large” usually means it stresses RAM and slows basic operations.
A dataset can be “very large” even with only a few million rows if it has:
object dtype)Think in terms of memory, not just rows × columns.
dew.point.qc 113.360548
temp.qc 113.360548
vis.var 113.360548
wind.type.code 113.360548
sky.cond 113.360548
ceiling.ht.method 113.360548
vis.dist.qc 103.298048
vis.var.qc 102.860548
wind.sp.qc 99.797945
wind.dir.qc 85.994595
dtype: float64
dtypes tells is what kind of variables we have in a dataset.USAFID int64
WBAN int64
year int64
month int64
day int64
hour int64
min int64
lat float64
lon float64
elev int64
wind.dir float64
wind.dir.qc object
wind.type.code object
wind.sp float64
wind.sp.qc object
ceiling.ht float64
ceiling.ht.qc int64
ceiling.ht.method object
sky.cond object
vis.dist float64
vis.dist.qc object
vis.var object
vis.var.qc object
temp float64
temp.qc object
dew.point float64
dew.point.qc object
atm.press float64
atm.press.qc int64
rh float64
dtype: object
If the file is really big (ours is borderline), we could read only the columns we need. This is where having a question in mind is useful. For ours, we probably just want station location (lat, lon), year, month, day, hour, temperature and humidity.
| USAFID | year | month | day | hour | lat | lon | temp | rh | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 690150 | 2019 | 8 | 1 | 0 | 34.3 | -116.166 | 37.2 | 19.881266 |
| 1 | 690150 | 2019 | 8 | 1 | 1 | 34.3 | -116.166 | 35.6 | 21.760980 |
| 2 | 690150 | 2019 | 8 | 1 | 2 | 34.3 | -116.166 | 34.4 | 18.482122 |
| 3 | 690150 | 2019 | 8 | 1 | 3 | 34.3 | -116.166 | 33.3 | 16.888622 |
Chunking in pandas means reading in a file in pieces so you don’t load the whole dataset into RAM at once.
np.float64(23.590542469664527)
This computes the overall mean temperature without ever storing the full dataset:
We’re basically doing: \(\bar{x} = \frac{\sum x_i}{\#\{x_i \text{ not missing}\}}\)
Once we have loaded the data we should check the dimensions of the dataset. In pandas, the we use met.shape.
We see that there are 2,377,343 records of hourly temperature in August 2019 from all of the weather stations in the US. The data set has 30 variables.
We know that we are supposed to have hourly measurements of weather data for the month of August 2019 for the entire US. We should check that we have all of these components. Let us check:
We can get a quick summary from met.describe.
| USAFID | WBAN | year | month | day | hour | min | lat | lon | elev | wind.dir | wind.sp | ceiling.ht | ceiling.ht.qc | vis.dist | temp | dew.point | atm.press | atm.press.qc | rh | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2.377343e+06 | 2.377343e+06 | 2377343.0 | 2377343.0 | 2.377343e+06 | 2.377343e+06 | 2.377343e+06 | 2.377343e+06 | 2.377343e+06 | 2.377343e+06 | 1.592053e+06 | 2.297650e+06 | 2.256068e+06 | 2.377343e+06 | 2.296387e+06 | 2.317254e+06 | 2.311055e+06 | 711069.000000 | 2.377343e+06 | 2.310917e+06 |
| mean | 7.230995e+05 | 2.953885e+04 | 2019.0 | 8.0 | 1.599589e+01 | 1.134106e+01 | 3.956097e+01 | 3.794132e+01 | -9.214509e+01 | 4.158152e+02 | 1.850082e+02 | 2.459166e+00 | 1.616648e+04 | 5.027185e+00 | 1.492117e+04 | 2.359054e+01 | 1.702084e+01 | 1014.164390 | 7.728208e+00 | 7.164138e+01 |
| std | 2.156678e+03 | 3.336921e+04 | 0.0 | 0.0 | 8.917096e+00 | 6.860069e+00 | 1.724761e+01 | 5.171552e+00 | 1.195324e+01 | 5.737842e+02 | 9.230697e+01 | 2.147395e+00 | 9.282465e+03 | 1.249445e+00 | 3.879577e+03 | 6.059275e+00 | 6.206226e+00 | 4.059265 | 2.018170e+00 | 2.275361e+01 |
| min | 6.901500e+05 | 1.160000e+02 | 2019.0 | 8.0 | 1.000000e+00 | 0.000000e+00 | 0.000000e+00 | 2.455000e+01 | -1.242900e+02 | -1.300000e+01 | 3.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 0.000000e+00 | -4.000000e+01 | -3.720000e+01 | 960.500000 | 1.000000e+00 | 8.334298e-01 |
| 25% | 7.209280e+05 | 3.706000e+03 | 2019.0 | 8.0 | 8.000000e+00 | 5.000000e+00 | 2.300000e+01 | 3.396700e+01 | -9.801800e+01 | 1.010000e+02 | 1.200000e+02 | 0.000000e+00 | 3.048000e+03 | 5.000000e+00 | 1.609300e+04 | 1.960000e+01 | 1.380000e+01 | 1011.800000 | 5.000000e+00 | 5.579015e+01 |
| 50% | 7.227280e+05 | 1.386000e+04 | 2019.0 | 8.0 | 1.600000e+01 | 1.100000e+01 | 5.000000e+01 | 3.835000e+01 | -9.170800e+01 | 2.520000e+02 | 1.800000e+02 | 2.100000e+00 | 2.200000e+04 | 5.000000e+00 | 1.609300e+04 | 2.350000e+01 | 1.810000e+01 | 1014.100000 | 9.000000e+00 | 7.655374e+01 |
| 75% | 7.250900e+05 | 5.476700e+04 | 2019.0 | 8.0 | 2.400000e+01 | 1.700000e+01 | 5.500000e+01 | 4.194000e+01 | -8.298600e+01 | 4.000000e+02 | 2.600000e+02 | 3.600000e+00 | 2.200000e+04 | 5.000000e+00 | 1.609300e+04 | 2.780000e+01 | 2.170000e+01 | 1016.400000 | 9.000000e+00 | 9.062916e+01 |
| max | 7.268130e+05 | 9.499800e+04 | 2019.0 | 8.0 | 3.100000e+01 | 2.300000e+01 | 5.900000e+01 | 4.894100e+01 | -6.831300e+01 | 9.999000e+03 | 3.600000e+02 | 3.600000e+01 | 2.200000e+04 | 9.000000e+00 | 1.600000e+05 | 5.600000e+01 | 3.600000e+01 | 1059.900000 | 9.000000e+00 | 1.000000e+02 |
It is a little harder to quickly summarize missing data in pandas, but here’s a way to count missing observations and calculate the percent missing.
| n_missing | pct_missing | |
|---|---|---|
| atm.press | 1666274 | 0.7009 |
| wind.dir | 785290 | 0.3303 |
| ceiling.ht | 121275 | 0.0510 |
| vis.dist | 80956 | 0.0341 |
| wind.sp | 79693 | 0.0335 |
| dew.point | 66288 | 0.0279 |
| rh | 66426 | 0.0279 |
| temp | 60089 | 0.0253 |
| lat | 0 | 0.0000 |
| day | 0 | 0.0000 |
| atm.press.qc | 0 | 0.0000 |
| year | 0 | 0.0000 |
| dew.point.qc | 0 | 0.0000 |
| month | 0 | 0.0000 |
| temp.qc | 0 | 0.0000 |
| vis.var.qc | 0 | 0.0000 |
| vis.var | 0 | 0.0000 |
| vis.dist.qc | 0 | 0.0000 |
| sky.cond | 0 | 0.0000 |
| lon | 0 | 0.0000 |
If we return to our initial question—what weather stations reported the hottest and coldest temperatures, we should take a closer look at our key variable, temperature (temp).
count 2.317254e+06
mean 2.359054e+01
std 6.059275e+00
min -4.000000e+01
25% 1.960000e+01
50% 2.350000e+01
75% 2.780000e+01
max 5.600000e+01
Name: temp, dtype: float64
loc, which is an indexer for selecting rows and columns.| USAFID | hour | lat | lon | elev | temp | wind.sp | |
|---|---|---|---|---|---|---|---|
| 496047 | 720717 | 0 | 29.117 | -89.55 | 36 | -40.0 | NaN |
| 496048 | 720717 | 0 | 29.117 | -89.55 | 36 | -40.0 | NaN |
| 496049 | 720717 | 0 | 29.117 | -89.55 | 36 | -40.0 | NaN |
| 496050 | 720717 | 1 | 29.117 | -89.55 | 36 | -40.0 | NaN |
| 496051 | 720717 | 1 | 29.117 | -89.55 | 36 | -40.0 | NaN |
We should check outside sources to make sure that our data make sense. The observation with -40°C is suspicious, so we should look up the location of the weather station.
Go to Google maps and enter the coordinates for the site with -40C (29.12, -89.55)

Filter with loc and sort the data with sort_values in one line:
| USAFID | WBAN | year | month | day | hour | min | lat | lon | elev | ... | vis.dist.qc | vis.var | vis.var.qc | temp | temp.qc | dew.point | dew.point.qc | atm.press | atm.press.qc | rh | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1203054 | 722817 | 3068 | 2019 | 8 | 1 | 1 | 56 | 38.767 | -104.300 | 1838 | ... | 9 | N | 5 | -17.2 | 5 | NaN | 9 | NaN | 9 | NaN |
| 1203127 | 722817 | 3068 | 2019 | 8 | 3 | 11 | 56 | 38.767 | -104.300 | 1838 | ... | 9 | N | 5 | -17.2 | 5 | NaN | 9 | NaN | 9 | NaN |
| 1203128 | 722817 | 3068 | 2019 | 8 | 3 | 12 | 56 | 38.767 | -104.300 | 1838 | ... | 9 | N | 5 | -17.2 | 5 | NaN | 9 | NaN | 9 | NaN |
| 1203221 | 722817 | 3068 | 2019 | 8 | 6 | 21 | 56 | 38.767 | -104.300 | 1838 | ... | 9 | N | 5 | -17.2 | 5 | NaN | 9 | NaN | 9 | NaN |
| 1203224 | 722817 | 3068 | 2019 | 8 | 6 | 22 | 56 | 38.767 | -104.300 | 1838 | ... | 9 | N | 5 | -17.2 | 5 | NaN | 9 | NaN | 9 | NaN |
| 1203052 | 722817 | 3068 | 2019 | 8 | 1 | 0 | 56 | 38.767 | -104.300 | 1838 | ... | 9 | N | 5 | -17.2 | 5 | NaN | 9 | NaN | 9 | NaN |
| 1203225 | 722817 | 3068 | 2019 | 8 | 6 | 23 | 3 | 38.767 | -104.300 | 1838 | ... | 9 | N | 5 | -17.0 | 6 | NaN | 9 | NaN | 9 | NaN |
| 1203220 | 722817 | 3068 | 2019 | 8 | 6 | 21 | 39 | 38.767 | -104.300 | 1838 | ... | 9 | N | 5 | -17.0 | 6 | NaN | 9 | NaN | 9 | NaN |
| 1203222 | 722817 | 3068 | 2019 | 8 | 6 | 22 | 32 | 38.767 | -104.300 | 1838 | ... | 9 | N | 5 | -17.0 | 6 | NaN | 9 | NaN | 9 | NaN |
| 1203223 | 722817 | 3068 | 2019 | 8 | 6 | 22 | 41 | 38.767 | -104.300 | 1838 | ... | 9 | N | 5 | -17.0 | 6 | NaN | 9 | NaN | 9 | NaN |
| 1203053 | 722817 | 3068 | 2019 | 8 | 1 | 1 | 6 | 38.767 | -104.300 | 1838 | ... | 9 | N | 5 | -17.0 | 5 | NaN | 9 | NaN | 9 | NaN |
| 1105458 | 722518 | 12974 | 2019 | 8 | 12 | 3 | 56 | 27.901 | -98.052 | 78 | ... | 9 | 9 | 9 | -17.0 | 1 | NaN | 9 | 1011.6 | 1 | NaN |
| 1105456 | 722518 | 12974 | 2019 | 8 | 12 | 1 | 56 | 27.901 | -98.052 | 78 | ... | 9 | 9 | 9 | -17.0 | 1 | NaN | 9 | 1010.3 | 1 | NaN |
| 1105454 | 722518 | 12974 | 2019 | 8 | 11 | 23 | 56 | 27.901 | -98.052 | 78 | ... | 9 | 9 | 9 | -17.0 | 1 | NaN | 9 | 1010.3 | 1 | NaN |
| 2370757 | 726764 | 94163 | 2019 | 8 | 27 | 11 | 50 | 44.683 | -111.116 | 2025 | ... | 5 | N | 5 | -3.0 | C | -5.0 | C | NaN | 9 | 86.265368 |
| 2370758 | 726764 | 94163 | 2019 | 8 | 27 | 12 | 10 | 44.683 | -111.116 | 2025 | ... | 5 | N | 5 | -3.0 | 5 | -4.0 | 5 | NaN | 9 | 92.910834 |
| 2370759 | 726764 | 94163 | 2019 | 8 | 27 | 12 | 30 | 44.683 | -111.116 | 2025 | ... | 5 | N | 5 | -3.0 | 5 | -4.0 | 5 | NaN | 9 | 92.910834 |
| 2370760 | 726764 | 94163 | 2019 | 8 | 27 | 12 | 50 | 44.683 | -111.116 | 2025 | ... | 5 | N | 5 | -3.0 | C | -4.0 | C | NaN | 9 | 92.910834 |
| 252488 | 720411 | 137 | 2019 | 8 | 18 | 12 | 35 | 36.422 | -105.290 | 2554 | ... | 5 | N | 5 | -2.4 | 5 | -3.7 | 5 | NaN | 9 | 90.914749 |
| 2370687 | 726764 | 94163 | 2019 | 8 | 26 | 12 | 30 | 44.683 | -111.116 | 2025 | ... | 5 | N | 5 | -2.0 | 5 | -3.0 | 5 | NaN | 9 | 92.966900 |
20 rows × 30 columns
Back to question: what weather stations reported the hottest and coldest daily temperatures?
Compute the min and max hourly temperatures and the associated stations. Filter with loc and find min and max (it ignores missing) with idxmin() and inxmax()
(temp -17.2
USAFID 722817
day 1
Name: 1203054, dtype: object,
temp 56.0
USAFID 720267
day 26
Name: 42402, dtype: object)
We need to transform our data to answer our initial question at the daily scale. Let’s find daily mean temperature by station and day. Use groupby and agg to calculate the mean and create a new dataset met_daily.
( USAFID day temp lat lon rh
20774 722817 3 -17.200000 38.767 -104.3 NaN
20773 722817 1 -17.133333 38.767 -104.3 NaN
20775 722817 6 -17.066667 38.767 -104.3 NaN
43605 726130 11 4.278261 44.270 -71.3 99.540440
43625 726130 31 4.304348 44.270 -71.3 99.710254,
USAFID day temp lat lon rh
27043 723805 5 40.975000 34.768000 -114.618000 11.369535
2345 720339 14 41.000000 32.146000 -111.171000 9.137857
27042 723805 4 41.183333 34.768000 -114.618000 14.211822
20623 722787 5 41.357143 33.527000 -112.295000 19.829822
30 690150 31 41.716667 34.299667 -116.165667 8.676717)
What day of the month was the hottest?
(temp -17.2
USAFID 722817.0
day 3.0
Name: 20774, dtype: float64,
temp 41.716667
USAFID 690150.000000
day 31.000000
Name: 30, dtype: float64)
To answer the last question about covariation between temperature and humidity, we can first calculate correlation
(np.float64(-0.5464705142984516),
np.float64(-0.5461260667477443),
np.float64(0.00034444755070728306))
Is it the same for the daily data?
(np.float64(-0.25729536769933825),
np.float64(-0.20801597675598085),
np.float64(0.0492793909433574))
The correlation is quite different between hourly and daily data. Why?
With exploratory graphs we aim to:
• debug any issues remaining in the data
• understand properties of the data
• look for patterns in the data
• inform modeling strategies
Exploratory graphs do not need to be perfect, we’ll look at presentation-ready plots with python’s ggplot equivalent.
What type of data are these types of exploratory graphs good for?
• histograms
• boxplots
• scatterplots
• barplots
• lineplots
• violin plots
• maps
Focusing on temperature, let’s look at the distribution of hourly temperature (after removing -40°C) using a histogram.
Let’s look at the daily data
A boxplot gives us an idea of the quantiles of the distribution and any outliers.
Often boxplots are better for grouped continuous data, such as temperature by hour (over all days in the dataset).
Often violin plots are helpful for grouped continuous data (they show the shape of the distribution), such as temperature by hour (over all days in the dataset).
Let’s check covariation with a scatterplot of humidity vs temperature.
Since there are so many points let’s take a random sample of 10,000 rows using sample and add alpha transparency to see the scatter better. random_state is like setting a seed for reproducibility.

May need to python3 -m pip install geopandas contextily to make these maps
import geopandas as gpd
import contextily as cx
# extract latitudes and longitudes
gdf = gpd.GeoDataFrame(
met_daily.dropna(subset=["lat", "lon"]).copy(),
geometry=gpd.points_from_xy(met_daily["lon"], met_daily["lat"]),
crs="EPSG:4326"
)
gdf_web = gdf.to_crs(epsg=3857)
ax = gdf_web.plot(markersize=10, alpha=0.6)
cx.add_basemap(ax, source=cx.providers.CartoDB.Positron)
ax.set_axis_off()
plt.show()folium.leaflet as another (my preferred) option, but it seems to have some problems in quarto.This is how folium works:
folium.Mapfolium.CircleMarker, add opacity and radius for point size hereimport folium
pts = met_daily[["lat", "lon"]].dropna().drop_duplicates()
m = folium.Map(
location=[pts["lat"].mean(), pts["lon"].mean()],
zoom_start=4,
tiles="CartoDB positron"
)
for _, r in pts.iterrows():
folium.CircleMarker(
location=[r["lat"], r["lon"]],
radius=2,
fill=True,
fill_opacity=0.8,
opacity=0.8,
).add_to(m)
mLet’s now look at where the hottest and coldest places were based on the daily temperatures. We add a layer for cold and a layer for hot.
cold = met_daily.loc[met_daily["temp"].idxmin()]
hot = met_daily.loc[met_daily["temp"].idxmax()]
cold, hot
center = [met_daily["lat"].mean(), met_daily["lon"].mean()]
m = folium.Map(location=center, zoom_start=4, tiles="CartoDB positron")
# Coldest (blue)
folium.CircleMarker(
location=[cold["lat"], cold["lon"]],
radius=10,
popup=f"Coldest daily mean: {cold['temp']:.2f}°C<br>USAFID={cold['USAFID']}<br>day={cold['day']}",
color="blue",
fill=True,
fill_color="blue",
fill_opacity=0.9,
).add_to(m)
# Hottest (red)
folium.CircleMarker(
location=[hot["lat"], hot["lon"]],
radius=10,
popup=f"Hottest daily mean: {hot['temp']:.2f}°C<br>USAFID={hot['USAFID']}<br>day={hot['day']}",
color="red",
fill=True,
fill_color="red",
fill_opacity=0.9,
).add_to(m)
mrh changes with tempplotnine (ggplot-style plots in Python)If you like ggplot2, plotnine uses the same “add layers with +” workflow: - map variables with aes() - add geoms (points, smoothers) - add labels/themes
import numpy as np
from sklearn.metrics import r2_score
from plotnine import (
ggplot, aes, geom_point, geom_smooth, labs, theme_bw, annotate, coord_cartesian
)
met_s = met[["temp", "rh"]].dropna().sample(min(10_000, len(met)), random_state=1)
# Fit line: rh = m*temp + b
x = met_s["temp"].to_numpy()
y = met_s["rh"].to_numpy()
m, b = np.polyfit(x, y, 1)
y_hat = m * x + b
r2 = r2_score(y, y_hat)
label = f"rh = {m:.2f}·temp + {b:.2f}\nR² = {r2:.3f}"
# location for the label
x_pos = np.quantile(x, 0.9)
y_pos = np.quantile(y, 0.999)
(
ggplot(met_s, aes(x="temp", y="rh"))
+ geom_point(alpha=0.2, size=1.0)
+ geom_smooth(method="lm", se=False)
+ annotate("text", x=x_pos, y=y_pos, label=label, ha="left")
+ coord_cartesian(ylim=(0, 100))
+ labs(
title="Temperature vs humidity (sample) with linear fit",
x="Temperature (°C)",
y="Relative humidity (%)"
)
+ theme_bw()
)
plotnine